So yeah, I have a difficult task because I'm going to make a short introduction talk and
then there will be a later talk I think by Borgan and Dominique that will speak maybe
about similar topics.
We'll also speak about optimal transport, so it's kind of a patchwork of several ideas,
but the key message is how to use techniques from PDE and modeling token as distribution
in order to gain some understanding on very deep transformers, which was initially introduced
in the PhD thesis of Michael Sanders, I forgot to put this picture here, but then I would
mention two contributions with Takashi and Martin on expressivity and Valérie and Pierre
I would say on the smoothness and on the PDE side.
Just to explain very briefly what are attention mechanisms and I think Borgan will go over
probably also.
This is basically the heart of most recent architecture and the key idea I would say
is that you would start by tokenizing the data.
So this is an example for text, but you can do vision transformers for images and for
proteins and transformers are really used everywhere and they are replacing basically
traditional neural network.
So you tokenize the data, you split into a group of letters and then each group of letters
is going to be encoded in a vector and you also add to this vector the information about
the position.
So its token is going to model some group of letters in the text plus its position in
the text, which means that you would, and you can do this for images, for images you
chunk your image into small patches.
So it means that now the data set or all the input data would be a collection of points.
So this is really a paradigm change because now it means that basically neural network
will operate on possibly infinite dimensional space because the number of tokens can be
arbitrarily large and can in fact in practice be very very large and when you use it to
generate new tokens you would really generate possibly infinitely long sequences.
So I think this is a major paradigm shift and the key idea or question that I want to
not really address but at least put forward is how to do theorems in infinite dimensional
space for neural architecture.
So how it goes it basically alternates just a single set of layers which are always the
same so there are three parts in a transformer.
The first one is normalization layer so I will not speak about this.
I think Borgan will probably do it, which means you project your token on the sphere.
It's very important because the attentions are very sensitive to the norm of the token
and so on but in my talk I will basically not speak about this.
Then there is the attention mechanism which is a big novelty so I spent a lot of time
to speak about this.
But what is very important is you also have traditional MLPs, so traditional small neural
network that would operate token independently.
So each token would be processed independently.
So of course this is very old so I will not speak about it but in practice it is also
very important.
In fact when you look about DeepSeq, Francis model, most of the parameters are in MLPs.
So MLPs are those that carry the most number of parameters.
If you remove this it doesn't work but of course it's like playing old MLPs so I would
not speak a lot.
So what is the attention at a high level?
It's a mechanism that would take each token and move it to a new location but this would
be conditioned on the neighbors.
Presenters
Prof. Gabriel Peyré
Zugänglich über
Offener Zugang
Dauer
00:28:24 Min
Aufnahmedatum
2025-04-29
Hochgeladen am
2025-04-29 16:13:30
Sprache
en-US
• Alessandro Coclite. Politecnico di Bari
• Fariba Fahroo. Air Force Office of Scientific Research
• Giovanni Fantuzzi. FAU MoD/DCN-AvH, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Borjan Geshkovski. Inria, Sorbonne Université
• Paola Goatin. Inria, Sophia-Antipolis
• Shi Jin. SJTU, Shanghai Jiao Tong University
• Alexander Keimer. Universität Rostock
• Felix J. Knutson. Air Force Office of Scientific Research
• Anne Koelewijn. FAU MoD, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Günter Leugering. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Lorenzo Liverani. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Camilla Nobili. University of Surrey
• Gianluca Orlando. Politecnico di Bari
• Michele Palladino. Università degli Studi dell’Aquila
• Gabriel Peyré. CNRS, ENS-PSL
• Alessio Porretta. Università di Roma Tor Vergata
• Francesco Regazzoni. Politecnico di Milano
• Domènec Ruiz-Balet. Université Paris Dauphine
• Daniel Tenbrinck. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Daniela Tonon. Università di Padova
• Juncheng Wei. Chinese University of Hong Kong
• Yaoyu Zhang. Shanghai Jiao Tong University
• Wei Zhu. Georgia Institute of Technology